# White Wine Quality by Sandra Muñoz

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

There are 4898 observations and 13 variables. But X is just a sequential count for each observation. There are 11 chemical variables: 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) And 1 quality varaible: 12 - quality (score between 0 and 10)

As we see in the table above, R thinks ‘quality’ variable is integer type, but in my opinion it should be interpreted as ordinal one, due to is a way to classify the wines from the besst to the worst. So I am going to do some transformations in the dataframe:

##  Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...

Statistical summary of the data:

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##     quality      qualityCat
##  Min.   :3.000   3:  20    
##  1st Qu.:5.000   4: 163    
##  Median :6.000   5:1457    
##  Mean   :5.878   6:2198    
##  3rd Qu.:6.000   7: 880    
##  Max.   :9.000   8: 175    
##                  9:   5

It can be seen that there are any wine with the best quality (10) and neither with the worst (0). The majority of the wines are in category 5 and 6. So, in our dataset it going to be difficult to get conclusions about what makes a wine to have the best quality or the worst.

Univariate Plots Section

Firsts variables are related to acidity, so we are going to start plotting them.

These four parameters looks normally distributed. But the four cases there is some positive skewing. There are few values for the higher x-axis values.

So I am going to plot again this variables, but excluding the top 1% of values.

Now, it is seen clearly the normal distribution of these variables. But, for example, in ‘citric.acid’ there are some peaks.

Now, let’s see to plot the other concentration related variables:

Again all variables looks to be normally distributed, but it appears to be better to exclude the top 1%.

Excluding the top 1%, it can be seen than residual.sugar appears to be log normal distributed.

It can be seen that there is a bimodal sitribution since there is a population centered around lows values and other population around high values.

Let’s plot the other variables:

Quality is normally distributed, with the majority of wines in the middle bins. Density also looks normal with some positive skew. On the other hand, alcohol looks multimodal.

Let’s see density and alcohol without top 1%:

Density is normally distributed, but alcohol looks trimodal with low, medium and high alcohol content populations.

I am going to create a new variable I think is interesting. Residual.sugar / alcohol.

It is interesting the peak for the low values of sugar / alcohol.

Univariate Analysis

The dataset has 4898 observations and 12 variables.

It has 11 relevant variables, 10 characteristics of the wine and one is the quality, a way to classify the wines from bad to good. But there is no one wine with quality equals to 10 and neither with quality equals to 0. So, the majority of the wines have medium quality.

I created a new variable to see the ratio residual.sugar / alcohol.

Algo I transform the quality variable to type categorical.

Bivariate Plots Section

First, let’s see the correlation among the different characteristics of the wine.

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
## sugar_alcohol           0.09363299       0.04575732  0.102730408
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
## sugar_alcohol            0.99001187  0.11932114        0.3143238443
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
## sugar_alcohol                 0.429487399  0.87168339 -0.2013195265
##                        sulphates     alcohol      quality sugar_alcohol
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831    0.09363299
## volatile.acidity     -0.03572815  0.06771794 -0.194722969    0.04575732
## citric.acid           0.06233094 -0.07572873 -0.009209091    0.10273041
## residual.sugar       -0.02666437 -0.45063122 -0.097576829    0.99001187
## chlorides             0.01676288 -0.36018871 -0.209934411    0.11932114
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067    0.31432384
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218    0.42948740
## density               0.07449315 -0.78013762 -0.307123313    0.87168339
## pH                    0.15595150  0.12143210  0.099427246   -0.20131953
## sulphates             1.00000000 -0.01743277  0.053677877   -0.01803066
## alcohol              -0.01743277  1.00000000  0.435574715   -0.53683146
## quality               0.05367788  0.43557472  1.000000000   -0.13475048
## sugar_alcohol        -0.01803066 -0.53683146 -0.134750485    1.00000000

Let’s plot the pairs of variables with higher correlation:

1. Residual_sugar and Density

It is seen a clear positive correlation between these two variables (0.839). When sugar content increases, density does also.

2. Alcohol and Density

In this case, the correlation is negative (-0.78).

3. Sugar_alcohol and Density

These variables have a strong positive correlation (0.871)

Let’s explore the correlation between quality and some parameters. Because in the analysis it would be good to see if there is some characteristics that determine if a wine os good or not.

4. Chlorides and alcohol

There is a slightly negative correlation.

5. Quality and alcohol

It is seen that since quality 5, alcohol content median starts to increase. So, it looks thatwines with better quality tend to have more alcohol.

6. Quality and Sugar_alcohol

It is not easy to get conclusions about this relationships because of median values move up and down.

7. Quality and Density

It can be observed that wines with higher quality have less density.

8. Quality and Chlorides

It looks that wines with higher quality have lower values of chlorides, but the decrease is very slightly. Also, there are lots of outliers in wines with quality 5 and 6.

Bivariate Analysis

It looks that density and chlorides decreases in better wines and alcohol increases.

The strongest relationship is between density and ratio sugar/alcohol.

Multivariate Plots Section

First, as there is a strong relation between density and alcohol, let’s plot it with the quality.

9. Alcohol, density and quality

It can be seen that higher quality wines tend to have high alcohol levels and low density.

9. Sugar_alcohol, density and quality

For a given sugar / alcohol value, better wines look to have lower density than the worst ones.

10. Residual_sugar, density and quality

We see in general better wines have lower density by a given sugar value.

11. Alcohol, chlorides and quality

There is no a clear relation here.

Multivariate Analysis

Density looks to be an important feature in wines. Also level of alcohol. Also the relation between sugar and density is important in order to decide the quality of a wine.


Final Plots and Summary

Plot One

Description One

In this grahg we see the clear positive correlatin between sugar/alcohol anf density. And better wines are bellow the tendency line, but it can be seen more dark spots in the area of lower sugar/alcohol levels.

Plot Two

Description Two

In my opinion this plot is interesting because it reflects the idea that better wines have more alcohol. It is quirious that in the case of bad wines is better to have less alcohol, but since medium quality, medium alcohol content starts to increase.

Plot Three

Description Three

This plot is interesting because it can be seen that better wines tend to have more alcohol content and less density levels. It is quirious that there are some dark spoots in the left area of the plot, so there are some good wines with few alcohol, but high density.


Reflection

This dataset have good information to get an idea of what makes a wine to be bad or good, but it contains many variables and some of them are likely related, like acid ones, and it’s difficult to make some conclusions.

The most strong positive correlation is between density and ratio sugar/alcohol (0.87), while the most negative correlation is between density and alcohol (-0.78).

For me, the most interesting insight is that better wines, in general, have high content of alcohol and less sugar, so the best wines are not so sweet.

With more time it could be analysed if some combinations of 3 or more features make special a wine. There are lot of wines in the medium quality but only 5 with quality 9 and no one with 10. It would be good to see why.